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Abstract. Consider a fully connected network where up to t processes 
may crash, and all processes start in an arbitrary memory state. The 
self-stabilizing firing squad problem consists of eventually guaranteeing 
simultaneous response to an external input. This is modeled by requir- 
ing that the non-crashed processes "fire" simultaneously if some correct 
process received an external "GO" input, and that they only fire as a 
response to some process receiving such an input. This paper presents 
Fire-Squad, the first self-stabilizing firing squad algorithm. 

The Fire-Squad algorithm is optimal in two respects: (a) Once the 
algorithm is in a safe state, it fires in response to a GO input as fast as 
any other algorithm does, and (b) Starting from an arbitrary state, it 
converges to a safe state as fast as any other algorithm does. 

1 Introduction 

The firing squad problem was first introduced in [2,3]. Informally, it is 
assumed that at any given round a process may receive an external "GO" 
input, which is considered a request for the correct processes to simul- 
taneously "fire." Roughly, a good solution is a protocol satisfying three 
properties: (a) if some process fires in round r then all the non-crashed 
processes fire simultaneously in round r; (b) if a correct process receives 
a GO input in round r 1 then it will fire at some later round r > r'\ and 
(c) a process fires in round r only if some process received a GO input in 
some round r' < r. (The formal definition disallows a solution in which a 
single input induces a constant firing.) 

Requiring the processes to fire simultaneously captures an important 
aspect of distributed systems: There are cases in which it is important 
that activities begin in the same round, e.g., when one distributed algo- 
rithm ends and another one begins, and the two may interfere with each 
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other if executed concurrently. Similarly, many synchronous algorithms 
are designed assuming that all sites start participating in the same round 
of communication. Finally, simultaneity may be motivated by the fact 
that a distributed system interacts with the outside world, and these in- 
teractions should often be simultaneously consistent. A non-simultaneous 
announcement to financial (stock) markets may enable unfair arbitrage 
trading, for example. 

Coordinating simultaneous actions is not subsumed by the consensus 
task. Indeed, even when no transient failures are considered possible (so 
there is a global clock and no self-stabilization is required), solving the 
firing squad problem or simultaneously deciding in a consensus task can 
be considerably harder than plain consensus [4,8]. This implies, in partic- 
ular, that clock synchronization [7,11,6,12,18] does not suffice for solving 
the firing squad problem in a self-stabilizing manner; as it can be seen 
as providing round-numbers to a self-stabilizing environment, which still 
leaves the firing squad problem as a non-trivial problem. 

The firing squad problem is a primary example of a problem requir- 
ing simultaneously coordinated actions by the non- faulty processes. Si- 
multaneous coordination has been shown to be closely related to the 
notion of common knowledge [10,9], and this connection has been used 
to characterize the earliest time required to reach simultaneous consen- 
sus, firing squad, and related problems in a variety of failure models 
[8,15,1,17,13,16]. One of the consequences of this literature is the fact 
that the time at which a simultaneous action that is based on initial val- 
ues or external inputs can be performed depends in a crucial way on the 
pattern in which failures occur. 

A general form of simultaneous agreement called continuous consensus 
was defined in [13]. In this problem, each of the processes maintains a list 
of events of interest that have taken place in the run, and it is guaranteed 
that the lists at all non- faulty processes are identical at all times. They 
present an optimal (non-stabilizing) implementation of such a service, 
which is a protocol called ConCon. If we define as the events to be 
monitored by ConCon to be of the form (go, p,k), corresponding to 
a GO message arriving at process p at the end of round k, then a firing 
squad protocol can be obtained from ConCon simply by having the non- 
faulty processes fire exactly when a (GO,p, k) event first appears in their 
identical copies of the "common" list. We shall refer to this solution to 
the firing squad problem based on ConCon by CCfs. 

Traditionally, the firing squad problem assumes that processes do not 
recover, i.e., failed processes stay failed forever. Moreover, even though 
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it is easy to extend the firing squad problem so that it can be repeat- 
edly executed (i.e., allow for multiple firings over time, given that multi- 
ple GO inputs are received), it assumes that nothing in the system goes 
amiss — except possibly for the crash failures being accounted for. Adding 
support for handling transient faults increases the robustness of a firing 
squad algorithm in this aspect. Indeed, a self stabilizing solution will, in 
particular, be able to cope with process recovery: Following process re- 
coveries, the system will eventually converge to a valid state and continue 
operating correctly. 

Transient faults alter a process's memory state in an arbitrary way. A 
self-stabilizing algorithm [5] is assumed to start in an arbitrary state and 
be guaranteed to eventually reach a state from which it operates according 
to its intended specification. Starting the operation at an arbitrary state 
enables the adversary to "plant" false information, such as the receipt of 
GO messages in the past, which can cause the algorithm to unjustifiably 
fire, either immediately, or within a few rounds. One of the challenges in 
designing an efficient self-stabilizing firing squad algorithm is in bounding 
the damage that can be caused by such false information in the initial 
state. 

Perhaps the first candidate solution would be to initiate an instance 
of CCfs in every round, with t + 1 instances executing concurrently at 
any given time, where t is an upper bound on the number of possible 
crashed processes. Firing would then take place if it is dictated by any 
of the instances. Since the component instances of such a solution are 
not themselves stabilizing, all we can show is that such a solution is 
guaranteed to stabilize after t + 1 rounds, regardless of the failure pattern. 
We shall present a solution that does not consist of such a concurrent 
composition. Moreover, it performs subtle consistency checks to restrict 
the impact of false information that appears in the initial state. As a 
result, in some cases we obtain stabilization in as little as two rounds. 

The above discussion points out the stabilization time as an impor- 
tant aspect of a self-stabilizing firing squad algorithm. Another central 
performance parameter is its swiftness: Once the algorithm has stabilized, 
how fast does it fire given that some process receives a GO input? In ad- 
dition to solving the self-stabilizing firing squad problem, the algorithm 
presented in this paper is also optimal in terms of both its stabilization 
time, and its swiftness. 
The main contributions of this paper are: 

- A self-stabilizing variant of the firing squad problem is defined, and 
an algorithm solving it in the case of crash failures is given. 
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- The proposed algorithm, called Fire-Squad, is shown to be optimal 
both in terms of the time it requires to stabilize and in terms of the 
time it takes, after stabilization, to fire in response to a GO input. 

- Finally, the optimality is demonstrated in a fairly strong sense: For 
every possible failure pattern, both stabilization time and swiftness 
are the fastest possible, in any correct algorithm. In extreme cases 
this enables stabilization in two rounds and firing in one round. 

The rest of the paper is organized as follows. Section 2 describes the 
model and defines the problem at hand. Section 3 provides lower bounds 
for the optimality properties. Section 4 describes the proposed solution, 
Fire-Squad, and proves its correctness and optimality. Finally, Section 5 
concludes with a discussion. 

2 Model and Problem Definition 

The system consists of a set V = {1, . . . , n} of processes. Communication 
is done via message passing, and the network is synchronous and fully 
connected. The system starts out at time 3 k = 0, and a communication 
round r starts at time k = r — 1 and ends at time k = r. At time k each 
process computes its state according to its state at time k — 1, the internal 
messages it received by time k (sent by other processes at time k — 1) 
and external inputs (if any) that it received at time k. In addition, at any 
time k > a process can produce an external output (such as "firing"). 

Let Zp £ {0, 1} represent the external input of process p at time k. We 
say that p received an external GO input at time k if T k = 1; Otherwise, 
(if 2p = 0), we say that p did not receive a GO input. Let 2 p = {1p}^L , 
let I k = {lp}p =1 and let I = {2 p }p =1 . 1 is "the input pattern", and l k is 
the (joint) input at time A;. In a similar manner define O k € {0, 1}, O p , O k 
and O as the output pattern. If O k = 1 we say that p fires at time k, 
and if O k = we say p does not fire at time k. It will be convenient to 
say that a fire action occurs at time k if O k = 1 for some process p, and 
similarly that a GO input is received at time k if T k = 1 for some p. 

Denote by t an a priori bound on the number of faulty processes in 
the system. For ease of exposition, we assume that t < n — 1, so that 
there are at least two processes that need to coordinate their actions. We 
assume the crash failure model, in which a faulty process p does not send 
any messages after its failing round; it behaves correctly before its failing 
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round, and sends an arbitrary subset of its intended messages during its 
failing round. 

A failure pattern describes for each time k which processes have failed 
by time k, and for each process that fails in round k (i.e., did not fail by 
time k — 1), which of its outgoing communication channels are blocked 
(and hence do not deliver its messages) in round k. Notice that a process 
may fail in round k even if all of its messages are delivered. We denote 
a failure pattern by and by T k the set of processes that fail in T 
by time k. Observe that T k C in the crash failure model failed 

processes do not recover. Similarly, we use G k = V \ T k to denote the 
set of processes that are non- faulty at time k. Finally, G will denote the 
set of processes that remain non-faulty throughout T, i.e., G = HfcLo ' ■ 
Notice that the set G is always defined in terms of a failure pattern J 7 , 
which is typically clear from the context. 

In addition to crashes, there are also transient faults. Formally, we 
denote by S k the state of a process p at time k. We denote by S k = 
(S k , . . . , S k , . . . , S k ) the state of the entire system at time k. Transient 
faults are captured by the assumption that the system may start from 
any (arbitrary) state, and there is some round r such that for all rounds 
r' > r the intended algorithm operates as written. In other words, for any 
possible state S, if 5° = S then eventually (starting from some round r) 
the algorithm operates correctly. 

For the following analysis, each algorithm A is assumed to have an 
initial state S^ it . For self-stabilizing algorithms, we fix an arbitrary state 
as S-^ it (as the algorithm should converge starting from any initial state) . 
The a priori bound of t on the number of failures is assumed to be hard- 
wired into the algorithm, and is not affected by transient faults. Such 
an algorithm is assumed to be executed only in the context of failure 
patterns in which at most t processes crash. For such failure patterns T ', 
the algorithm A produces an output pattern O starting from state S 
given an input T; we denote this output pattern by O = A(S,I,J 7 ). 

Informally, the Firing Squad problem requires that: (1) all processes 
fire together ("simultaneity"); (2) if a GO input is received then a fire 
action occurs ( "Uveness"); and (3) the number of fire actions is not larger 
than the number of received GO inputs ( "safety"). Formally, 

Definition 1. Let O = A(S,I,J r ) and let G denote the set of pro- 
cesses that remain non-faulty throughout J-. We say that O satisfies the 
FS(fc) properties (capturing correct firing- squad behavior from time k on) 
w.r.t. X, T , and O, if the following conditions hold for all k! > k: 
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1. (simultaneity) If Op = 1 for some p € V then O* = 1 for all q £ G; 

2. (liveness) If 2^ = 1 for some p € G, then there is k" > k' s.t. Op' = 1; 

3. (safety) The number of times k" satisfying k < k" < k' at which a 
fire action occurs at k" is not larger than the number of times h in 
the range < h < k' at which GO inputs are received. 

We can use the FS(k) properties to define when an algorithm solves 
the firing squad problem in a self stabilizing manner. We first use it to 
define the stabilization time of an algorithm as follows: 

Definition 2 (Stabilization time). The stabilization time of A on S, 
I and T, denoted by stab(^4, S,2, T), is the minimal k > such that 
FS(fc) holds with respect to I, T , and O = A(S,2,J T ). (IfFS(k) holds for 
no finite k, then stab(.4, S,2, T) = oo.) 

Notice that the "safety" property in FS(fc) relates outputs starting 
from time k to inputs starting from time 0. Here's why: Since we consider 
time to be the point at which transient errors end, if the system starts in 
a state in which "it appears as if" GO inputs were received before time 0, 
the good processes may fire after time without a GO message actually 
having been received. Once all firings induced by such "phantom" GO 
inputs have occurred, we can legitimately require firing events to happen 
only in response to genuine GO message receipts. We thus think of the 
stabilization time, at which in particular the safety property of FS(/c) 
holds, as one after which no firing will occur in response to phantom GO 
messages. Rather, every firing will be justifiable as a response to some GO 
message received at or after time 0. 

Definition 3 (SSFS Algorithm). An algorithm A solves the Self sta- 
bilizing Firing Squad problem (A is an SSFS algorithm, for short) if there 
exists a k < oo such that stab(^4, S,I, J-) < k for every system state S, 
input pattern T and failure pattern T . 

Observe that in a setting with no transient faults, an algorithm A 
solves the (non-self-stabilizing) Firing Squad problem if it satisfies FS(0) 
with respect to I, J 7 , and O, for every 2, T and O = A(S^ it ,2, J r ). 

Notice that Definition 3 implies that any SSFS algorithm A has at 
least one memory state from which the firing squad properties are guar- 
anteed to hold. Denote one of these memory states by S^: ab , or simply 
<Sstab when A is clear from the context. 
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2.1 Optimality Measures 

In this work we are interested in finding an optimal SSFS algorithm. 
We start by defining stabilization time optimality, which measures how 
quickly algorithm A stabilizes. 

Definition 4. An SSFS algorithm A is said to optimally stabilize if the 
following holds for every SSFS algorithm B and every failure pattern T : 

max{stab(.4, S,I, J-)} < max{stab(S,5,I, J-)} . 

Definition 4 defines optimality of an algorithm A with respect to its 
stabilization time, i.e., how quickly A starts to operate according to all 
of the FS requirements. The intuition behind defining optimality in terms 
of worst-case S and X is to avoid algorithms that are "specific" to an 
initial memory state or input pattern. Thus, by requiring optimality in 
the worst-case we ensure that the algorithm cannot be hand-tailored to a 
specific setting, but rather needs to solve the SSFS problem in a "generic" 
manner. 

We now turn to the issue of comparing the responsiveness of distinct 
firing squad algorithms. Specifically, we are concerned with how quickly 
an algorithm fires after a GO message is received (once the algorithm 
has stabilized). For simplicity, we consider receipts of GO by non- faulty 
processes, since the problem specification forces a firing following such 
a receipt. Another subtle issue is that if GO messages are received in 
different rounds between which there is no firing, then it may be difficult 
to figure out which GO message the next firing is responding to. Again 
for simplicity, we will be interested in what will be called sequential input 
patterns, in which a GO is not received before all previous go's have been 
followed by firings. More formally, we define: 

Definition 5 (Sequential inputs). Let A be an SSFS algorithm. We 
say that the input X is sequential with respect to (A, S, T) if (i) no GO 
inputs are received according to I at times k < st&b(A,S,X,J-), (ii) GO 
inputs are received inX only by processes from G, and (Hi) ifk\ < &2 and 
GO inputs are received at both k\ and k<i, then there is an intermediate 
time k\ < k' < k% at which a fire action occurs. 

The following definition formally captures the number of firing events 
that occur between the stabilization time and a given time k. 

Definition 6. Let A be an SSFS algorithm and let O = A(S,I, J 7 ). 
We define #[(A, S,I, T\ k] to be the number of rounds k' in the range 



stab(„4, S,I, J- ) < k' < k such that Op = 1 holds for some process p 
(i.e., a firing occurs at time k'). 

By definition, if k < stab(A,S,X, F) then #[(A,S,X, J 7 ), k] = 0. 
With the last two definitions, we are now able to formally compare the 
responsiveness of different SSFS algorithms: 

Definition 7 (Swiftness). Let A and B be SSFS algorithms. We say 
that A is at least as swift as B if A fires at least as quickly as B on all 
sequential inputs. Formally, we require that for every failure pattern T ', 
input X, and states S4 of A and of B, the following holds. If X is 
sequential both with respect to (A, £4, T) and with respect to (B, Sb, 
T), then #[(A, <S^,X, J 7 ), k] > #[(B,Sj3,X,F),k] holds for every time k. 
An SSFS algorithm A is optimally swift if it is at least as swift as B for 
every SSFS algorithm B. 

We are now in a position to state the main result of the paper: The 
Fire-Squad algorithm of Figure 1 is an SSFS algorithm (Theorem 3), is 
optimally stabilizing (Theorem 4) and is optimally swift (Theorem 5). 

3 Lower Bounds 

In this section we provide lower bounds for the stabilization time and for 
the swiftness of any SSFS algorithm A. The lower bounds build upon 
previous results in the field of simultaneous agreement. 

Recall that if A is a non-self-stabilizing Firing Squad algorithm, then 
stab(*4, S£ it ,X, J 7 ) = for all X and T. Therefore, in the non-self- 
stabilizing case, it only makes sense to compare algorithms in terms of 
their "swiftness." In a non-self-stabilizing setting, the firing squad proto- 
col CCfs (based on ConCon [13]) is optimally swift. We will use it as a 
benchmark and yardstick for expressing and analyzing the performance 
of self-stabilizing firing squad protocols. To compare the performance of 
different algorithms, we make use of the following definitions. 

Definition 8. We denote by 5{J-,k) the number of processes known at 
time k to be faulty by the processes in G k in a run of CCfs with failure 
pattern T . 

Intuitively, S(F, k) stands for the number of failures that are discovered by 
time A; in a run with pattern T. We remark that 5(J-, k) is well-defined, 
because the same number of faulty processes are discovered (at the same 
times) in all runs of CCfs that have failure pattern J 7 . Moreover, since 
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CCfs detects failures as a full-information protocol does, no algorithm A 
can discover more failed processes than CCfs does (see [8]). Thus, 5(F, k) 
is an upper bound on the number of failed process discovered by time k 
by any algorithm A. 

CCfs makes essential use of a notion of horizon, which is roughly the 
time by which past events are guaranteed to become common knowledge. 
This motivates the following definitions. 

Definition 9 (Horizons). Given a failure pattern T ' , the horizon dis- 
tance at time k, denoted by disH(^ r , k), is t + 1 — 5{J-, k). The absolute 
horizon at time k, denoted absH(jF, k), is k + disH(.F, k). 

While the absolute horizon is an upper bound on when events become 
common knowledge, the publication time is a lower bound on this time. 
It is defined as follows: 

Definition 10 (Publication Time). Given a failure pattern J-, the 
publication time for (time) k, denoted by tt(J-, k), is min/ c />/ c {absH(^ r , k')}. 

When T is clear from the context, it will be omitted from S(k), 
disH(fc), absH(A;) and ir(h). 

As shown in [13], for a given failure pattern a GO input received 
at time k is "common knowledge" not before time ir{J-,k). Thus, for a 
specific algorithm A, the publication time for bounds (from below) the 
time k at which the first firing action can occur in O = A(S s t a b,I,J-). 

The publication time tt(J-, k) is a generalization of notions developed 
in [8] for Simultaneous (single-shot, non-stabilizing) Consensus. In that 
paper, a notion of the waste of J- is defined, and information about ini- 
tial values — which can be viewed in our setting as being about external 
inputs at time — becomes common knowledge at time t + 1 — waste. In 
our terminology, this occurs precisely at the publication time tt(J~, 0) for 
events of time 0. 

The intuition behind the first lower bound is that if CCfs receives a 
GO input at time 0, then it fires at time vr(0) (Lemma 1). Since CCfs is 
optimal, an SSFS algorithm A cannot fire faster. Therefore, if we consider 
A starting in a memory state where A "thinks" it received a GO input 1 
round ago, A will fire not before time 7r(0) — 1. The formal proof appears 
in the proof of Theorem 1. 

Lemma 1. Let T be any failure pattern and let X be an input pattern 
for which Z* = for every process q and time k > 0, except for one pro- 
cess p e G forwhichl® = 1. The first fire action ofO = CCfs(s££ fs ,:I, F) 
occurs at time 7r(.F, 0). 
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Proof. A result of the work done in [13] . 

Notation 1 For input I and an integer i > 0, denote by2(i — >) the input 
pattern that is obtained by excluding the first i rounds of 2. Formally, 
2(i — = 2 k+l for all k > 0. Similarly denote T(i — >) (w.r.t. T). 

Lemma 2. Let T be a failure pattern. Let T' be a failure pattern with 
no faults at time k = and J-'(l — >) = T . Then ir(J-', 0) > n(J-, 0). 

Proof. For every time k we have that 5(J-',k) < 5(J-,k). Therefore, 
absHfT', k) > absH(T, k) holds for all k > 0. Thus, min fc > {absH(.F / , k)} > 
minfc>o{absH(.F, k)}, i.e., tt(T', 0) > %(T , 0). □ 

Following is the first lower bound result, stating that the worst case 
stabilization time of every SSFS algorithm A is at least 7r(0). 

Theorem 1. maxs,x{stab(A,S,T, T)} > vr(^ r , 0) holds for every SSFS 
algorithm A and every failure pattern T . 

Proof. To prove this theorem, we find a state S and input X such that 
stab(^4, S,2, X") > 7r(X", 0). Since A solves the SSFS problem there is a 
memory state S s t a b from which all of the FS properties hold. 

Let p € G be a process that is non-faulty throughout J 7 , and consider 
the following input path 2': for all q,k it holds that 2'^ = except for 
2'p = 1. Consider T' to be a failure pattern with no failures at time k = 
(i.e., = 0) and J-'(l — >) = T for the rest. Due to "liveness", .A's run 
from S s tab with input X' and failures T' will eventually fire; denote the 
firing time as k (i.e., O k = 1 for some process p). 

By Lemma 1, ir(J-' , 0) is the optimal time for simultaneous firing, and 
since starting from S sta b ah properties hold, including "simultaneity", it 
holds that k > vr(X"',0). 

Consider memory state S of A after executing a single round with 2' 
as input and T' as failure pattern and Si n a as starting memory state. 
Consider the run of A from S with input 2 and failure pattern T . A 
must fire at time k — 1, as it cannot distinguish the run from Si n n,2', T' 
and from S,2,T . By Lemma 2, 7r(JP"',0) > vr(X", 0), and therefore „4 
will not fire before time k — 1 > 7r(X"',0) — 1 > ^(^",0) — 1. However, 
notice that 2 contains only "0" inputs, implying that "safety" does not 
hold for A when starting from S with input 2 and failure T for the first 
n(J-, 0) — 1 rounds. I.e., "safety" can hold starting from time tt(J-, 0) and 
on. Therefore, max5 j j{stab(^4, S,2, J 7 )} > 7r(jF,0). □ 
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Our second lower bound result, informally stating that any SSFS algo- 
rithm cannot fire faster than CCfs, is captured by the following theorem. 
(Notice that the claim is made with respect to sequential input patterns.) 

Theorem 2. Let A be an SSFS algorithm, X a sequential input, T a 
failure pattern and O = A{S s t a i^X,!F). For every k > for which a 
GO input is received in X k there is no fire action in O during times k! 
satisfying k < k! < ir^J 7 , k). 

Proof. Suppose by way of contradiction there is such a time k' , and con- 
sider the earliest such time k' satisfying k < k' < tt{J-, k) for which a fire 
action occurs in O k . Denote by S k the memory state of A at time k. 

Since A started to run from S s t a b, FS(0) holds with respect to X, T 
and O. Since X is sequential, and k' is the minimal time for which O has 
a fire action after time k, we have that X{k — ►) contains a GO input at 
time and does not contain a GO input until time k'. Therefore, O = 
A(S k ,X(k —*),J-(k — ►)) will have its first fire action at time k' — k. 

From CCfs's optimality and together with Lemma 1, A cannot fire 
before time it{!F(k —0,0). Thus k' — k > ir(J-(k —0,0) leading to k' > 
k + 7r(J-(k — >), 0). By definition of tt and J-{k — >) we have that ir(J-, k) < 
n{J-{k —0,0), contradicting the assumption that k < k! < tt{J-, k). □ 

4 Solving SSFS 

The algorithm Fire-Squad in Figure 1 is an SSFS algorithm that is both 
optimally stabilizing and is optimally swift. For swiftness, the algorithm is 
based on the approach used in the CCfs algorithm, in which the horizon 
is computed by monitoring the number of failures that occur, and a firing 
action takes place when the receipt of a GO becomes common knowledge. 
The horizon computation at a process p makes use of reports that p 
receives from other processes regarding failures that they have observed. 
Following a transient fault, the state of a process may contain arbitrary 
(including false) information about failures. In the crash failure model, 
a process q will learn about (truly) crashed processes in the first round. 
Consequently, p will compute a correct horizon one round later, once it 
receives reports from all such processes. Roughly speaking, this can be 
used as a basis for a (nontrivial) solution that stabilizes within two rounds 
of the optimal time. 

In order to improve on the above and obtain an optimal algorithm, 
Fire- Squad employs a couple of subtle consistency checks. The first one 
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Algorithm Fire-Squad (t) 

: do forever: /* executed on process p at time k */ 

/* process p is unaware of the value of k */ 
1: receive all available (Requests q , Failed q , Views q ) messages from process q 6 V; 

/* update variables according to messages of round k and external input * / 
2: set Requests^)] := Tp, 

3: for 1 < i < t + 1: set Requests[i] :— max q {Requests q [i — 1]}; 
4: set Failed' := \J Failed q ; 

5: set Failed := all processes that p did not hear from this round; 
6: for 1 < i < t: set Views[i — 1] := min 9 { Views q [i]} + 1; 

/* calculate horizon at time k — 1 */ 
7: set Horizon := t + 1 — min{|Fa^ed'|, |fai£ed|}; /* consistency check I */ 

8: set Views[Homzon-l] := 1; 

9: for < i < t: set Vie?i;s[i] := max{ Vietos[i], Horizon — i}; /* check II */ 

/* should we fire ? */ 
10 : if for some i' > Kieiws[0] it holds that Requests[i'] = 1 then 
11 : for i' < i" < t + 1: set Requests[i"] := 0; 

12: do "Fire"; 

13: fi; 

/* send round k + 1 messages to all processes */ 
14: send (Requests, Failed, Views) to all; 
15: od. 

Clean up: 

Requests contains only {0, 1} values. Views contains only values £ {0, ...,£ + 1}. 
Fig. 1. FlRE-SQUAD: a self-stabilizing firing squad algorithm. 



involves checking the information obtained from other processes regard- 
ing failures they observed before the current round started. In the crash 
failure model, every failure observed by q by time k — 1 must be directly 
observable by p no later than time k. So if the set of failures reported 
to p contains failures that p has not directly observed, then it must be 
time k < 1, and p will use the set of failures that it has directly observed 
in computing the horizon, instead of the set of reported failures. A subtle 
proof shows that, in this case, the computed horizon works correctly if 
k = 1, which is crucial for the algorithm's stabilization optimality. The 
second consistency check is based on the fact that in normal operation the 
horizon distance is (weakly) monotone decreasing. The local state con- 
tains information about previous horizon computations, and our second 
consistency check forces it to satisfy weak monotonicity. 
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We now turn to describe the details of Fire- Squad. The following 
discussion and lemmas are stated w.r.t. the algorithm and its components. 
For a variable var, we denote by var^ the value of var at process p after 
the computation step at time k. 

Each process p has a vector Requests p [i] , which represents p's infor- 
mation about a GO input received by some process i time units ago; and 
this request was not fulfilled yet. More precisely, if Requests^ [i] = 1, then 
some process received a GO input at time k — i, and no firing action oc- 
curred between time k — i + 1 and time k. The vector Requests contains 
values for the previous t + 1 time units and the current time; a total of 
t + 2 entries. 

In addition, each process has a set Failed, which consists of the pro- 
cesses it has seen to be failed in the current round. That is, at time k, 
process p's Failed 1 ! set contains all processes that process p did not re- 
ceived messages from during round k (i.e., messages sent at time k — 1). 
Failed' is the union of all Failed sets (as received from other processes) of 
the previous round. That is, at time k, Failed'^ is the union of Failed!^ -1 
as computed at time k — 1 by every process q that p received messages 
from during round k. 

Finally, each process keeps track of a vector Views. If FiewSp[i] = z 
it means that at time k + i, data from time k — z is common knowledge. 
The vector Views contains t + 1 entries, for the current round and the 
coming t rounds. 

For ease of exposition every process p is assumed to send messages 
to itself. Moreover, a process executing the algorithm is unaware of the 
current round number. We refer to such rounds using numbers k etc. for 
ease of exposition in describing and analyzing the algorithm. 

4.1 Correctness Proof 

A central notion in the analysis of simultaneous actions under crash fail- 
ures is that of a clean round [8] . In the non-stabilizing setting, a round r 
is clean according to failure pattern T if no process considered non-faulty 
by all processes at time r — 1 is known to be faulty by one or more (non- 
crashed) processes at time r. In a setting that allows transient faults, we 
use a slightly different definition for the exact same notion. Consider a 
process p that fails in round k. We say that p fails silently in round k if it 
is not blocked according to T from sending messages in round k to any of 
the processes q £ G fc . Thus, no process surviving round k can detect p's 
failure in this round. 
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Definition 11 (Clean Round). Round r in failure pattern T is a clean 
round if (i) no process fails silently in round r — 1, and (ii) all processes 
(if any) that fail in round r fail silently. 

This definition of a round r being clean in T coincides with the standard 
definition of clean rounds previously used in non-stabilizing systems [8]. 
In protocols such as Fire-Squad, with the property that every process 
sends the same message to all other processes in every round, all (non- 
crashed) processes receive the same set of messages in a clean round (see 
Lemma 3). 

We start with an overview of the proof, following a detailed proof. 
Proof outline: 

1. Once a clean round has occurred, different processes agree on the value 
of Requests (Lemma 3 and Lemma 4); 

2. Thus, if processes agree on the value of Fzeu;s[0] they are guaranteed 
to act simultaneously, either firing together or, together, refraining 
from firing (Lemma 8); 

3. Lemma 9 and Lemma 10 show that V^eu>s[0] is the same at all non- 
crashed process (once a clean round has occurred); 

4. Points 1, 2 and 3 above lead to Lemma 11, stating that once a clean 
round occurs, "simultaneity" holds; 

5. "Uveness" holds by Lemma 12; 

6. Lemma 13 and Lemma 14 lead to Lemma 15 which states that "safety" 
holds starting from round 7r(0). This, according to the lower bounds, 
is optimal; 

7. Lemma 16 (together with Lemma 14) shows that Fire-Squad fires by 
time ir(k) given a GO input at time k. The lower bound in Theorem 2 
implies that this is optimal; 

8. Finally, Theorem 3, Theorem 4 and Theorem 5 show that Fire-Squad 
is an SSFS algorithm that optimally stabilizes and is optimally swift. 

Lemma 3. If round r is clean, then the sets Failed r , Failed ,r , and the 
array Views r are identical for all non-faulty processes. 

Proof. In the Fire- S quad algorithm every process sends its Failed set 
and Views array to all other processes in every round. If round r is clean, 
then all processes receive the same information about the values of Failed 
and Views in the system. Thus, the value of Views computed on Line 6, 
which depends on the Views q values received in the current round, is the 
same for all p £ G. Similarly, value of Failed' calculated on Line 4, which 
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depends on the Failedq sets received is the same at all p £ G. Finally, in 
a clean round, all non- faulty processes receive messages from the same 
set of processes. As a result, the value of Failed (computed on Line 5) is 
the same all pGG. Since changes to Failed, Failed' and Views performed 
on Line 7-13 depend only on the values of Failed, Failed' and Views, 
the same changes are performed by all non-faulty processes. The claim 
follows. □ 

Lemma 4. Let r be a clean round, let < d < t and let p,p' 6 G r+d . 
Then RequestSp +d [i] = Requests r J~ d [i] holds for all i in the range d < i < t. 

Proof. We prove the claim by induction on d. The base case is d = 0, in 
which round r+d = r is a clean round, and all non-faulty processes receive 
the same set of messages. Thus, by Line 3, we have that Requests^ [i] = 
Requests^/ [i] for all i in the range d = < i < t. Let < d < t, and assume 
inductively that the claim holds for d — 1. The inductive assumption 
guarantees that when the Requests q arrays are sent in round r + d they 
agree for all i satisfying d— 1 < i < t. In particular, ma,x q {Requests q [i — 1]} 
is the same for all i > d. Since RequestsJi] is set to max q {Requests q [i — 1]} 
on Line 3, it follows that Requests r p +d [i] = Requests 7 ^ ~ d [i] holds for all 
d < i < t, as claimed. □ 

The purpose of Line 7 is to perform our first consistency check, com- 
paring the reported Failed q values (from the previous round) to failures 
directly observed by p in the current round (stored in Failed p ). We now 
show that this can matter only at times k < 1. At all times k > 2, 
Line 7 can be viewed as having the simpler form of setting the horizon to 
t + 1 — \ Failed' p \. 

Lemma 5. Horizon[\ = t + 1 — \Failed'p\ holds after Line 7 is executed, 
for all times k > 2 and p € G k . 

Proof. If k > 2 then k — 1 > 1, and so the values of Failed' received 
by p at time k contain only processes that were indeed faulty by the 
end of round k — 1. Since failure patterns are monotone, none of these 
processes sends p a message in round k. Hence, by Line 5 we obtain that 
Failed' p C Failed';. □ 

We denote the first clean round in an execution of Fire- Squad by 
r c . By definition, r c > 1. We can show: 
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Lemma 6. If k > 1, then Horizon^ 1 < Horizon k for every p,p' G G k+1 . 
Moreover, for k > min{2, r c } Horizon 1 ^ 1 < Horizon k for every p G G k 
and p' G G fc+1 . 

Proof. We start with the second case of the lemma: Let k > min{2,r c } 
and let p G G k ,p' G G fc+1 . In particular, either k > 2, or k = r c = 1. We 
consider each of these cases separately. Assume that k > 2, and let q G V 
be a process that updates Failed q at time k — 1. According to Line 5, 
Failedq contains processes that q does not receive messages from during 
round k — 1. All of these processes do in fact fail no later than round 
k — 1. Thus, the set Failed' computed by process p at time k contains 
only faulty processes. The set Failedq at time k contains all processes of 
Failed^" 1 . Thus, the set Failed' p i at time k + l contains all processes from 

Failed'p. Hence, Failed' k C Failed' 1 ^ 1 . Therefore, by Lemma 5, following 
Line 7 by p' at time k + 1 we have that Horizon 1 ^ 1 < Horizon k . 

Now consider the case k = r c = 1. Thus, p and p' receive the same 
set of messages during round 1, and compute Failed and Failed' in the 
same manner. Thus, Horizon^ = Horizon^,. Moreover, by Line 4 we 
have that Failed 'y D Failed^,. It follows that min{|Fazledp|, |Fa&d'p|} = 
min{\ Failed^, |, | Failed'*/ 1} < \Failed'pt\. By Lemma 5 Horizon^, := i + 1 — 
| Failed'^ |, hence Horizon^, < — min{|Fa&^|, | Failed'^ |} = Horizon^. 
That is, we obtain that Horizon^ 1 < Horizon k . 

To finish the proof, we are left to handle the case when k = 1 and 
G G 2 . Since p G G 2 by time 2 we have that received p's round 2 mes- 
sages. Implying that Failed^ C Failed' p>. Moreover, due to the monotonic- 
ity of crashes, also Failed^ C Failed^,. Therefore, min{| Failed' 2 / 1, \ Failed^,\} > 

\Failed^\ > min{| Failed 'p\, \Failed^\}. Hence, by Line 7 Horizon^, < Horizon^. 

□ 

Denote by minH(jF, A;) the lowest value of Horizon k , i.e., minH(.F, k) = 
min p {Horizon k } . When T is clear from the context, we write minH(A;). 
Notice that minH is the equivalent of disH with respect to Fire-Squad 
(recall that disH is computed according to CCfs). 

Lemma 7. Let k > min{2, r c } + l and letp G G k . Then, for allO < i < t, 
Line 9 does not change the value of Views k [i\. 

Proof. Since k > min{2,r c } + 1 we have that k — 1 > min{2,r c } > 1. 
At time k — 1, for every process q and every < i < t it holds that 
Views k ~ l [i] > Horizon k ~ l — i, due to Line 9. At time k all processes 
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update Views according to Line 6, thus setting every entry i (for i 7^ t) to 
be > minH(&; — 1) — i. By Lemma 6 (recall that k — 1 > min{2,r c }) 
it holds that max q {Horizon k } < minH(/c — 1). Since for every i ^ t, 
Views p [i] > minH(£; — 1) — % it also holds that Views p [i] > Horizonp — i. 
Hence, max{ Views p [i], Horizon p — i} = Views p [i]. Thus, for all entries that 
are not t, Line 9 does not change Views. □ 

Observation 1 For every k > 1, it holds that5(k—l) < \Failed^\ < S(k). 
In a similar manner, 5(k - 1) < \Failed'p +1 \ < 6(k). 

Lemma 8. Let k > r c and let p,p' € G k . If Views p [0] = Views k p , [0] then 
p and p' have the same external output at time k (i.e., they either both 
fire or they both do not fire at time k). 

Proof. Consider the value of Vzeu;Sp[0]. Let k! < k be the maximal time at 
which Views p [k — k'] was updated due to Line 8. Notice that Views p [k — 
k'\ = 1, and by the update in Line 6 it holds that Viewst[0] > k — k' + l. 

Moreover, Horizon^ = k — k' + 1, i.e., t + 1 — \Failed'p \ = k — k' + 1. 

Between time k' — 1 and time k there are k — k' + l rounds. From 
the above discussion, at time k' — 1 there were at least t + k' — k failed 
processes. Thus, between time k' — 1 and time k there was some clean 
round. Denote this clean round by r. 

By Lemma 4, for every i, k — r < i < t, it holds that Requests^ [i] = 
Requests„,[i]. Since Views^O] >k-k' + l>k-r+l, we have that for 
every i, Views p [0] < i < t it holds that Requests^ [i] = Requests^ [i]. Thus, 
p and p' either both pass the condition of Line 10 or they both do not 
pass. Leading to the fact that either p,p' both fire, or they both do not 
fire. □ 

Lemma 9. For every k > min{2,r c } and p £ G fc , if minH(A;) = 1 then 
Views k p [Q] = 1. 

Proof. If k = r c then by Lemma 3 every process p has Horizon p = 
minH(/c). Therefore, if minH(/c) = 1 then by Line 8, p sets Views k [0] = 1. 

Continue with the case that k 7^ r c , i.e., k > 2. If minH(A;) = 1, then 
some process q has Horizon 1 ! = 1. Thus q has \Failed ,k .\ = t. Notice that 
Failed' q contains processes that were faulty during round k — 1. Therefore, 
Failed^ = Failed' g, which leads to the conclusion that all Failed^ 1 sets 

received by q and used in the construction of Failed' were received from 
non-faulty processes. Thus, all processes receive these sets, and process p 
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also has I Failed 



ik 
P 



t leading to Horizon* = 1. Thus, by Line 8, p has 

□ 



Views* [0] = 1. 



Lemma 10. Let r > r c and p,p' € G r . //minH(r) > 1 then Views p [i] = 
ViewSp,[i] holds for all < i < minH(r) — 1. 

Proof. The proof is by induction on r > r c . For r = r c , we have by 
Lemma 3 that Views p = Views p i , and the claim immediately follows. For 
the inductive step, assume that r > r c and that the claim holds for r — 1. 
We consider two cases. First assume that minH(r) = minH(r — 1). In this 
case, no process failure is discovered in round r. Thus, round r is clean, 
and the claim follows by Lemma 3 as in the base case. 

Next, assume that minH(r) < minH(r — 1). The Views p [i] values can 
change only on Line 6, Line 8, and Line 9. First consider the change by 
Line 6. In this case, Views[i — 1] is set to min g { Fiews g [i]} + 1 for 1 < i < t. 
By the inductive assumption we have that FzewSpfi] = Views p /[i] holds 
for all < i < minH(r) — 1 before Line 6 is applied. Since the values of 
Viewsfj] before Line 6 are shifted down by one, and become the values of 
Views[j — 1] after it is applied, we obtain that Views p [i] = Views p '[i] for 
all < i < minH(r — 1) — 2 once Line 6 has completed. Since minH(r) < 
minH(r — 1), we have that minH(r) — 1 < minH(r — 1) — 2. Consequently, 
Views p [i] = Views p '[i] for all < i < minH(r) — 1 when Line 7 is reached. 

On Line 8, Views p [Horizon r — 1] is set to 1. By definition, minH(r) < 
Horizon r , so the update does not affect values Vzems[z] for % < minH(r) — 1. 
Hence, the fact that Views p [i] = Views p i[i] for all < i < minH(r) — 1, 
which was shown above to hold when Line 7 is reached also holds when 
Line 9 is reached. 

By Lemma 7, since r — 1 > r c Line 9 does not change the value of 
Views p [i], for all < i < t. Since minH(r) — 1 < t we have that after Line 9 
Views p [i] = Views p '[i] for all < i < minH(r) — 1. □ 

Lemma 11. "simultaneity" holds for all times k > r c . 

Proof. By Lemma 9 and Lemma 10, for two processes p, p' it holds that 
Fieu'SpfO] = Views p '[0]. Together with Lemma 8 we have that p,p' fire 
together or do not fire together, for every r >r c . □ 

Lemma 12. "liveness" holds for all times k > 0. 

Proof. If some non- faulty process p received a request to fire at time k, 
then it sets Requests* [0] = 1. Since Views p [0] > Horizon* > 1, p will not 
update Requests p [0] = due to Line 11. Thus, at time k + 1 it holds 
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that Requests^ 1 [1] = 1; and in general, if by time k + i p does not set 
RequestSp +l [i] = then it holds that RequestSp +l+1 [i + 1] = 1. 

Notice that if p sets Requests^ 1 [i] = (for i > 1), then p executes 
Line 11, indicating that p fires. Notice that ViewSp +t+1 [0] < Horizorip +t+1 < 
t + 1. Thus, if by time k + t + 1 p has not set RequestSp +t+1 [t + 1] = 0, 
then at time fc + t + lp will fire. 

And we conclude that within t + 1 rounds p will fire, and "liveness" 
holds. □ 

Lemma 13. Let k > 1, and let p € V . If k' is such that p € G k ' and 
k' = k + Horizonp - 1, then ViewSp [0] < Horizon^. 



Proof. Let p be any process and consider time k: by Line 8 process p sets 
ViewSp[Horizonp—l] = 1. For Horizonp = 1, it holds that Vieio^fi/'on^onp- 
1] = Fieu;Sp[0] = 1 < Horizonp. 

The rest of the proof concentrates on the case that Horizonp > 1. 
At time fe' = A; + 1 if p does not update ViewSp +l [Horizon 1 ! — 2] due 
to Line 8, it holds that ViewSp +1 {Horizon^ — 2] < 2; and in general, 
if at time k' = k + j p does not update Views^ [Horizonp — 1 — j] 
then ViewSp +J [Horizon^ — 1 — j] < 1 + j. Notice that if p does update 
Views p [Horizonp — l—j] due to Line 8 then p has Views p [Horizonp—l—j] = 
1 < l + j. 

Thus, at time k' = k + Horizon 1 ! — 1 it holds that Fieu^ [0] < 
Horizonp . □ 

Define minHG(jF, fc) = min pG G Horizonp and use it to define bestH(J r , k) 
minfc/>fc{fe' + minHG(J r , &/ + 1)}. If is clear from the context, we use 
bestH(fc). 

Notice that minHG is similar to minH except that minHG considers only 
Horizon values of processes that never crash, while minH considers pro- 
cesses that haven't crashed yet. Also, notice that bestH is the equivalent 
of 7r with respect to Fire-Squad (recall that ir is computed according to 
CCfs). 

Lemma 14. bestH(A;) < ir(k), for every k > 0. 

Proof. Consider the value of ir(k) = mmfc/^jabsH^A;')}, and denote by k" 
the latest time for which the minimum is reached. I.e., ir{k) = absH(/c") = 
k" + t + l-x(k"), and for all k' > k" it holds that absH(A; / ) > ir(k). Thus, 
8{k" + 1) = 5{k") (otherwise, absH(/c" + 1) < absH(/c"), contradicting the 
choice of k"). 



20 



Since 5(k" + 1) = 5(k") it holds that no new failed processes are 
discovered at round k" + 1. Consider two options, k" > 1 and k" = 0. 
When k" > 1 it follows that k" + 1 > 2 and therefore every non-faulty 
process p at time k" + 1 has Horizon^ +1 = t+l — 5(k"). Thus, minHG(A; // + 
1) = t+l-5{k") leading to /c"+minHG(fc" + l) = k"+disK(k") = absH(fc"). 

Consider the case that k" = 0. By Definition 8, 5(k") = 5(0) = 
leading to d±sE(k") = t + 1. Since Horizon l p < t + 1 it follows that 
k" + minHG(A; // + 1) < k" + disE(k") = a.bsE(ak"). 

For both k" > 1 and k" = we conclude that k" + minHG(/c" + 1) < 
absH(a/c"). Since ir(k) = absH(/c") we conclude that bestH(/c) < n(k). □ 

Lemma 15. "safety" holds at all times k > it(0). 

Proof. Let p € G be a process such that Horizon^ +1 = minHG(&/ + 1). 
By Lemma 6, for all k" > k' + 1 it holds that Horizon^ < Horizon!! +1 
(notice that fe' + 1 > 1, and p G G fc "). 

Consider time fe' + i (for i > 1), by Lemma 13 for every time fe" = A/ + 
i + Horizonp' +l - 1 it holds that ViemSp"[0] < Horizon k p +t < Horizon^' +1 . 

Thus, for every time A;" > k' + Horizonp' +1 = bestH(O) it holds that 
Views k p'[0] < Horizon k p +1 < bestH(O). 

By Lemma 14 we have that bestH(O) < vr(0). Hence, For every time 
k" > tt(0) it holds that Views k "[0] < ir(Q). Consider time vr(0). Since 
r c < ^(O), by Lemma 11, "simultaneity" holds. Therefore, if some process 
fires then all processes in G 71 ^ ) fire. For any process q £ G n (°\ If q fires, 
then it sets all Requests^ 1 [i] = for all i > Viewsq^ [0]. If q does not fire, 
then it is because Requests^ ^ [i] = for all i > Viewsq^ [0] . Moreover, 
since Viewsq ^ [0] < 7r(0), it holds that RequestSq^ [i] = for all i > 7r(0). 

Since for every k" > tt(0) it holds that Viewsf[0] < vr(0), we have 
that if process p has Requests^ [i] = 1, it must have been set at some time 
> 0. In other words, if a fire action occurs then there was a previous 
GO input received; and because Requests^ [i] is zeroed once a fire action 
occurs, each GO can induce at most a single fire action. Thus, the number 
of times bestH(O) < k' < k for which a fire action occurs is not larger 
than the number of times < k' < k during which a GO input is received. 

□ 

Lemma 16. Let input X be sequential with respect to (Fire-Squad, S, J~) 
IfX k = 1 for process p at time k then O k = 1 for k < k' < bestH(/c). 

Proof. Since X is sequential and X k = 1 it holds that p £ G. Consider 
bestH(O) = mmi{i + minHG(i + 1)}, and denote by k' a time that satisfies 
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k! + minHG(/c' + 1) = bestH(O). Let q G G be some process such that 
Horizon*' +1 = minHG(fc' + 1). Since k' + 1 > 1 and g € G, by Lemma 13, 
at time k" = k' + Horizon*' +1 = bestH(O) it holds that Views*" [0] < 
Horizon* +1 . 

If p fires at some time k < k" < bestH(fc) then the claim is proved. 
Otherwise, at time k" = bestH(A;) it holds that Views* [0] < Horizon* +1 . 
Since X£ = 1 and p £ G, by time k + 1 we have that i?eguesteg +1 [l] = 1. 
Since p does not fire before time bestH(fe) and since "simultaneity" holds, 
we have that by time k" = bestH(fe) it holds that Requests* [Horizon* +1 ] = 
1. Therefore, at time k" = bestH(A;) q will fire and due to "simultane- 
ity" p will fire as well. And we conclude that for some time k", satisfying 
k <k" < bestH(fc), we have that O*" = 1. □ 

Theorem 3. Fire-Squad solves the SSFS problem, it optimally stabi- 
lizes and is optimally swift. 

Proof. Consider any initial state S, any input path X and any failure 
pattern T. By definition, 7r(X", 0) < t + 1. Thus, by Lemma 15, "safety" 
holds starting from time t + 1. Since by time t + 1 there is a clean round, 
by Lemma 11, "simultaneity" holds starting from time t + 1. Lemma 12 
finishes the proof, and we have that for time k = t + 1 it holds that 
stab(FiRE-SQUAD, S,2, JF) <k. □ 

Theorem 4. Fire-Squad optimally stabilizes. 

Proof. By Lemma 15, the "safety" property of Fire-Squad holds from 
time 7r(.F, 0). Moreover, by Lemma 11 together with the fact that by 
time vr(^ r , 0) there is a clean round, the "simultaneity" property of Fire- 
Squad holds from time tt(J-, 0). Combined with Lemma 12 we have that 
stab(FiRE-SQUAD,<S,X, JF) < ^(T, 0); for any state S, input path X and 
failure pattern T. I.e., max5 i j{stab(FiRE-SQUAD, 5, J, J 7 )} < 7r(.F, 0). 

Let A be any SSFS algorithm. By Theorem 1 for every failure pattern 
J- we have that max5 i j{stab(^4, S,I, J-)} > 7r(J-,0). Thus, for every T: 
max5 i i{stab(FlRE-SQUAD,5,X, J 7 )} < max 5) x{stab(^4, S,l, J 7 )}. □ 

Theorem 5. Fire-Squad is optimally swift. 

Proof. Let input X be sequential with respect to (Fire-Squad, Sfire-Squad,^")- 
By Lemma 16, if X* = 1 for some process p at time k then for some k' 
satisfying k < k" < bestH(fc) it holds that O*' = 1. Therefore, by time 
bestH(fc) we have that #[(Fire-Squad,<Sfir E -Squad,X, X"), bestH(fc)] is 
no smaller than the number of GO inputs received by time k. 
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Let A be any SSFS algorithm and I sequential with respect to (A, £4, J 7 ) . 
By Theorem 2, for every k > for which a GO input is received in 
2 k there is no fire action in O = A(S^,I, J-) during times k' satisfy- 
ing k < k' < 7r(J-,k). Since bestH(A;) < ir(k) (Lemma 14), it holds that 
by time bestH(fc), the value of #[(A, S^,X, bestH(A;)] is at most equal 
to the number of GO inputs received by time k. 

Thus, for every £4, 5fir, E -Squad 5 ? and sequential X it holds that 
#[(FiRE-SQUAD,5 F iR E -squAD,2:,-? : "),fc] > #[(A,S A ,X, T), k], for all k. □ 

5 Conclusions and Open Problems 

This paper presents Fire-Squad, the first self-stabilizing firing squad 
algorithm. FlRE-SQUAD is optimal in two important respects: It optimally 
stabilizes, and is optimally swift. There are many directions in which this 
work can be extended. These include: 

- Fire-Squad assumes the crash fault model. What can be said about 
the omission fault model? And what about the Byzantine fault model? 
Each such extension seems to be a nontrivial step. 

- Fire-Squad works when we assume that failures are permanent. Be- 
ing an ongoing and everlasting service, firing squad is expected to 
operate for long periods, in which processes may recover. A more rea- 
sonable assumption in this case is that there is a bound (of t) on 
the number of failures over every interval of m rounds, for some m. 
(Non-stabilizing) Continuous consensus has recently been studied in 
this model [14], and it would be interesting to see if the same can be 
done for self-stabilizing firing squad. 
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